"
],
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAeIAAAEWCAYAAAC66pSsAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAd8klEQVR4nO3de7wdZX3v8c+XBA0F5SJoQYUoSgtGQYyoNFqjHrReUOsFU2/U1Fvb9FCrL6npUbxEUaq2jce2SipVNCKoLV5BVNQAVgmXEAh4hVLpURFEpICAv/PHPBsXm72TnWQnTxI+79drvfZaM8/M85vJ5bvmmdkzqSokSVIf2/UuQJKkuzKDWJKkjgxiSZI6MoglSerIIJYkqSODWJKkjgxiaRuR5OIkj+9dx+aU5MgkK3rXsS5JjklyYnu/d5JfJpmxAet5Q5Ljp79C9TSzdwGSpibJL0c+/hZwM3Bb+/zKqnrIZqrjGOBBVfWizdHftqaq/hPYaV3t2peqE6vqfiPLvn0TlqZODGJpK1FVt//nneRy4E+q6ox+Fd01JZlZVbf2rkPbDoempW1EksuTPKm9PybJyUlOTHJ9kouS7Jfkr5P8JMmVSQ4bWXbnJMuS/HeSHyV520RDp0meArwBOKINr17Ypu+V5NQk1yT5XpKXr6XOE5K8P8kX2jrOSvLbSf4uybVJLk3y8JH2Ryf5ftuOS5I8ey3rPi7JirY9U9qmkf11SpKTWj/nJTlw3L59fZJVwA1JZiZ5dJKzk/w8yYWjpwWSPCDJ19q6vgTsPjJvdpJKMrN93i3Jh5Jc1bb/35LsCHwB2Kvto1+2fXz7EHdb9vB2SuLnSc5Msv+4ml+bZFWS69q2zZps36kfg1jadj0D+AiwK3A+cBrDv/n7Am8B/nmk7QnArcCDgIcDhwF/Mn6FVfVF4O3ASVW1U1WNhdXHgf8C9gKeC7w9yRPWUtvzgb9hCKibgXOA89rnU4D3jLT9PvBYYGfgzcCJSfYcXVmS7ZJ8EHgYcFhVXTfVbRrxTOBkYDfgY8C/Jdl+ZP4C4GnALsB9gM8Bb2vtXwt8Mskere3HgJVte94KvHQt/X6E4VTDQ4B7A++tqhuAPwCuavt5p6q6atw27wcsB44C9gA+D3wmyd1Gmj0feArwgLZvjlxLHerEIJa2Xd+oqtPaMOrJDP9ZH1tVtzAE5+wkuyS5D/BU4KiquqGqfgK8F3jBVDpJcn/g94DXV9VNVXUBcDzwkrUs9umqWllVNwGfBm6qqg9X1W3ASQzBCUBVnVxVV1XVr6vqJOC7wCEj69qeIZB2A55RVf+zgdu0sqpOafvnPcAs4NEj8/+hqq6sqhuBFwGfr6rPt7q+BJwLPDXJ3sAjgf9TVTdX1deBz0yy7/ZkCNxXVdW1VXVLVX1tLTWOOgL4XFV9qdX8t8AOwKHjar6qqq5pNRw0xXVrM/IcsbTt+vHI+xuBq1vQjX2G4aKhvRjC7L+TjLXfDrhyiv3sBVxTVdePTLsCmLsetY3/PHo+/CXAa4DZIzXvPtL+QcCBwCFV9as2bR/Wf5tun1dVv04ydoR/p/lt/c9L8oyRadsDX23LXNuOasdcAdx/gj7vz7Dvrl1LXZPZq613tOYrGUY8xvy/kff/wx23R1sIg1jSlQzDw7tP8SKk8Y9suwrYLck9RsJ4b+BHG1tYkn2ADwJPBM6pqtuSXABkpNka4P8CX0jyhKq6jPXfJhgJyiTbAfdj2LYxo9t9JfCRqrrTufBW865JdhwJ4725834bW89uSXapqp+Pm7euR+NdBTx0pN+0bdjo/a7Ny6Fp6S6uqv4bOB14d5J7tvOt+yb5/UkW+THDsPZ2bfkrgbOBdySZleRhwELgxEmWXx87MgTSTwGS/DEwZ4JtWM5wEdkZSfbdgG0CeESSP2wXUR3FEOTfnKTticAzkjw5yYy23Y9Pcr+quoJhmPrNSe6WZB7D+fo7aXV+AXh/kl2TbJ/kcW32j4F7Jdl5kho+ATwtyRPbuey/ajWfvZZt1BbIIJYEw/ncuwGXANcyXDC15yRtT24/f5bkvPZ+AcPQ8VUM53zfNB2/WlVVlwDvZriY68cMR4BnTdL2XxkuQvtKktms3zYB/DvDeddrgRcDf9jOvU7U15UMF3e9geFLwpXA6/jN/6l/BDwKuAZ4E/DhtfT7YuAW4FLgJwxfAqiqSxnOff+gXRV9h2HlduT/ImApcDVD2D9jZHheW4lUrWv0Q5K2bfEmJerII2JJkjoyiCVJ6sihaUmSOvKIWJKkjvw9Yq233XffvWbPnt27DEnaaqxcufLqqtpjonkGsdbb7NmzOffcc3uXIUlbjSRXTDbPoWlJkjoyiCVJ6sggliSpI4NYkqSODGJJkjoyiCVJ6sggliSpI4NYkqSODGJJkjoyiCVJ6sggliSpI4NYkqSODGJJkjoyiCVJ6sggliSpI4NYkqSODGJJkjoyiCVJ6sggliSpI4NYkqSODGJJkjoyiCVJ6sggliSpI4NYkqSODGJJkjoyiCVJ6sggliSpI4NYkqSODGJJkjoyiCVJ6sggliSpI4NYkqSODGJJkjoyiCVJ6sggliSpI4NYkqSODGJJkjoyiCVJ6sggliSpI4NYkqSODGJJkjoyiCVJ6sggliSpI4NYkqSODGJJkjqa2bsA3bUc+ObTue7GWwC44p1PZ5/XfxaAnXfYngvfdFjP0iSpC4NYm9V1N97C5cc+DYC8k9vfzz76cz3LkqRuHJqWJKkjg1iSpI5SVb1r0FZm7ty5de65527QskkY+zs32XtJ2tYkWVlVcyeat0mPiJO8N8lRI59PS3L8yOd3J3lNksOTHN2mPSvJASNtzkwyYfHj+jouycVJjtuAOg9K8tT1XW5TSHJMktduwHK7JPnTkc97JTllequTpLue5cuXM2fOHGbMmMGcOXNYvnz5tK5/Uw9NnwUcCpBkO2B34CEj8w8Fzq6qU6vq2DbtWcABrL9XAA+rqtdtwLIHAesVxBlsSUP7uwC3B3FVXVVVz+1YjyRt9ZYvX87ixYtZunQpN910E0uXLmXx4sXTGsabOkjOBh7T3j8EWA1cn2TXJHcH9gfOS3JkkvclORQ4HDguyQVJ9m3LPi/Jt5J8J8ljx3eS5FRgJ2BlkiOS7JHkk0m+3V6/19odkuScJOcnOTvJ7yS5G/AW4IjW5xHjj0qTrE4yu70uS/Lhti33T/K61seqJG+eoLYZSU5o67goyV+26fsm+WKSlUm+keR3J1h2wjZJ7pPk00kubK9DgWOBfds2HNdqXd3az0ryodb/+Unmt+lHJvlU6+O7Sd61nn++krRNW7JkCcuWLWP+/Plsv/32zJ8/n2XLlrFkyZJp62OT/vpSVV2V5NYkezMc/Z4D3JchnK8DLqqqXyUZa392C9XPVtUpMJw7BGZW1SFt+PhNwJPG9XN4kl9W1UFtmY8B762qFa3v0xhC/1LgsVV1a5InAW+vquckeSMwt6r+vC1/zFo268HAS6vqm0kOa58PAQKcmuRxVfX1kfYHAfetqjlt3bu06R8AXlVV303yKOD9wBPG9TVZm38AvlZVz04yg+FLyNHAnJF9MHtkPX827KZ6aAvz05PsN1Lfw4GbgcuSLK2qK8dvdJJXMIw6sPfee69l90jStmPNmjXMmzfvDtPmzZvHmjVrpq2PzfF7xGczhPChwHsYgvhQhiA+a4rr+FT7uRKYPYX2TwIOGAt44J5JdgJ2Bv41yYOBArafYv+jrqiqb7b3h7XX+e3zTgzBPBrEPwAemGQp8DmGENyJYR+cPFLj3Uc7WUebJwAvAaiq24Drkuy6lprnAUtb+0uTXAGMBfGXq+q61uclwD7AnYK4qj7A8MWAuXPnelWVpLuE/fffnxUrVjB//vzbp61YsYL9999/2vrYHEE8dp74oQzDuVcCfwX8AvjQFNdxc/t5G1OreTvg0VV10+jEJO8DvtqOJGcDZ06y/K3ccdh+1sj7G0ZXCbyjqv55skKq6tokBwJPBl4FPB84Cvj52NHrWrZhXW2mw80j76e6fyXpLmHx4sUsXLiQZcuWMW/ePFasWMHChQundWh6c1xsdDbwdOCaqrqtqq5huLDoMW3eeNcD99jIPk8HFo19SDIWZjsDP2rvj1xLn5cDB7dlDwYeMEk/pwEva0evJLlvknuPNkiyO7BdVX0S+Bvg4Kr6BfDDJM9rbdLC+nbraPNl4NVt+owkO0+wDaO+Abywtd8P2Bu4bJK2kqRmwYIFLFmyhEWLFjFr1iwWLVrEkiVLWLBgwbT1sTmC+CKGq6W/OW7adVV19QTtPw68rl1UtO8E86fiL4C57QKqSxiORAHeBbwjyfnc8cjvqwxD2RckOQL4JLBbkouBPwe+M1EnVXU68DHgnCQXAadw5zC8L3BmkguAE4G/btNfCCxMciFwMfDMCbqYrM3/Bua3PlcCB1TVz4Cz2kVh43+F6/3Adq39ScCRVXUzkqR1WrBgAatXr+a2225j9erV0xrC4A09tAE25oYes4/+3G/uNT1yE4/R6ZK0rUmvG3pIkqS1M4glSerIK2S12Y0+8nDs/c47bMhvkknS1s8g1mZ1h/PAx3p9giQ5NC1JUkcGsSRJHRnEkiR1ZBBLktSRQSxJUkcGsSRJHRnEkiR1ZBBLktSRQSxJUkcGsSRJHRnEkiR1ZBBLktSRQSxJUkcGsSRJHRnEkiR1ZBBLktSRQSxJUkcGsSRJHRnEkiR1ZBBLktSRQSxJUkcGsSRJHRnEkiR1ZBBLktSRQSxJUkcGsSRJHRnEkiR1ZBBLktSRQSxJUkcGsSRJHRnEkiR1ZBBLktSRQSxJUkcGsSRJHRnEkiR1ZBBLktSRQSxJUkcGsSRJHRnEkiR1ZBBLktSRQSxJUkcGsSRJHRnEkiR1ZBBLktSRQSxJUkcGsSRJHRnEkiR1ZBBLktSRQSxJUkcGsSRJHRnEkiR1ZBBLktSRQSxJUkcGsSRJHRnEkiR1ZBBLktSRQSxJUkcGsSRJHRnEkiR1ZBBLktSRQSxJUkcGsSRJHRnEkiR1ZBBLktSRQSxJUkcGsSRJHRnEkiR1ZBBLktSRQSxJUkcGsSRJHRnEkiR1ZBBLktSRQSxJUkcGsSRJHRnEkiR1ZBBLktSRQSxJUkcGsSRJHRnEkiR1ZBBLktSRQSxJUkcGsSRJHRnEkiR1ZBBLktSRQSxJUkcGsSRJHRnEkiR1ZBBLktSRQSxJUkcGsSRJHc3sXYA03oFvPp1VxzyZfV7/2dun7bzD9lz4psM6ViVJm4ZBrC3OdTfeAsDlxz7t9mmzj/5cr3IkaZNyaFqSpI4MYkmSOjKI1V2Sdba54p1P3wyVSNLmt9YgTvLeJEeNfD4tyfEjn9+d5DVJDk9ydJv2rCQHjLQ5M8nc6Sg2yRvWMu95SdYk+eoGrHeXJH+6cdVNjySPT/LZdbeccNmjkvzWyOfPJ9ll+qpTD8uXL2fOnDnMmDGDOXPmsHz58t4lSZpG6zoiPgs4FCDJdsDuwENG5h8KnF1Vp1bVsW3as4AD2DQmDWJgIfDyqpq/AevdBVjvIE4yYwP62pSOAm4P4qp6alX9vGM92kjLly9n8eLFLF26lJtuuomlS5eyePFiw1jahqwriM8GHtPePwRYDVyfZNckdwf2B85LcmSS9yU5FDgcOC7JBUn2bcs+L8m3knwnyWMBksxK8qEkFyU5P8n8Nv3IJO8bKyDJZ9tR4rHADm29Hx0tMskbgXnAsiTHJZnRfn47yaokr2ztdkry5STntX6f2VZxLLBvW/dx449K27Yd2d5fnuSdSc5r23VYknPaOk9OstP4nZjkL5Jc0mr5eJu2Y5J/afvl/JFaRpebsE3bvr9Nsrqtc1GSvwD2Ar46NirQat29vX9Na796bJQjyew2ivDBJBcnOT3JDuv4O6HNaMmSJSxbtoz58+ez/fbbM3/+fJYtW8aSJUt6lyZpmqz115eq6qoktybZm+Ho9xzgvgzhfB1wUVX9auwcX1WdneRU4LNVdQrcfv5vZlUdkuSpwJuAJwF/NixSD03yu8DpSfZbSy1HJ/nzqjpognlvSfIE4LVVdW6SVwDXVdUj2xeGs5KcDlwJPLuqftEC6put3qOBOWPrTvL4dey3n1XVwW0dnwKeVFU3JHk98BrgLePaHw08oKpuHhkqXgx8pape1qZ9K8kZ45abrM1LgNnAQVV1a5LdquqaJK8B5lfV1aMrSfII4I+BRwEB/iPJ14BrgQcDC6rq5Uk+ATwHOHH8Brd9+gqAvffeex27Z/1lCueJ74rWrFnDvHnz7jBt3rx5rFmzplNFkqbbVC7WOpshhMeC+JyRz2dNsZ9PtZ8rGQIEhiPYEwGq6lLgCmDSIF5PhwEvSXIB8B/AvRgCJ8Dbk6wCzmD4UnGfDVj/Se3noxmG4c9qfb0U2GeC9quAjyZ5EXDrSI1Ht+XOBGYB4xNusjZPAv65qm4FqKpr1lHvPODTVXVDVf2S4c/jsW3eD6vqgvZ+9M/nDqrqA1U1t6rm7rHHHuvobv1V1e2v0Rt53NXtv//+rFix4g7TVqxYwf7779+pIknTbSo39Bg7T/xQhqHpK4G/An4BfGiK/dzcft42hT5v5Y5fEGZNsY9RARZV1Wl3mDgML+8BPKKqbkly+STrX1cNN4z086WqWrCOep4GPA54BrA4yUPbss+pqsvG1Tj6xWCyNuvobr3cPPL+NsCh6S3I4sWLWbhwIcuWLWPevHmsWLGChQsXOjQtbUOmekT8dOCaqrqtHX3twjA8ffYE7a8H7jGF9X4DeCFAG5LeG7gMuBw4KMl2Se4PHDKyzC1Jtp/Cuk8DXj3WNsl+SXYEdgZ+0kJ4Pr85eh1f8xXAAUnu3oaEnzhJP98Efi/Jg1o/O44fXs9wkdv9q+qrwOtbDTu1GhelpWqSh0+yHRO1+RLwyiQz2/TdJtmOMd8AnpXkt9p+eHabpi3cggULWLJkCYsWLWLWrFksWrSIJUuWsGDBur77SdpaTOWI+CKGq6U/Nm7aTuPPRTYfBz7YLh567lrW+37gH5NcxHAEemQ7h3oW8EPgEmANcN7IMh8AViU5r6peuJZ1H88wxHpeC7GfMlzN/VHgM63Pc4FLAarqZ0nOSrIa+EJVva6dL13dajl/ok6q6qftKHt5OxcN8DfAd0aazQBOTLIzwxHuP1TVz5O8Ffi7tj3btX7G/7LsZG2OZxjGX5XkFuCDwPva/vlikqtGrx6vqvOSnAB8a2z/VNX5SWavZR9qC7FgwQKDV9qGpap616CtzNy5c+vcc8+dtvUlYfTv4eyjP8cV73z6HaaNbyNJW5MkK6tqwntqeGctdTeVgPUCLknbKoNYkqSODGJJkjryecTaYo0+g3jnHaZysbwkbX0MYm1xLj/2aXCsF2ZJumtwaFqSpI4MYkmSOjKIJUnqyCCWJKkjg1iSpI4MYkmSOjKIJUnqyCCWJKkjg1iSpI4MYkmSOjKIJUnqyCCWJKkjg1iSpI4MYkmSOjKIJUnqyCCWJKkjg1iSpI4MYkmSOjKIJUnqyCCWJKkjg1iSpI4MYkmSOjKIJUnqyCCWJKkjg1iSpI4MYkmSOjKIJUnqyCCWJKkjg1iSpI4MYkmSOjKIJUnqyCCWJKkjg1iSpI4MYkmSOjKIJUnqyCCWJKkjg1iSpI4MYkmSOjKIJUnqyCCWJKkjg1iSpI4MYkmSOjKIJUnqyCCWJKmjVFXvGrSVSfJT4IredYyzO3B17yI2gHVvXta9eVn3b+xTVXtMNMMg1jYhyblVNbd3HevLujcv6968rHtqHJqWJKkjg1iSpI4MYm0rPtC7gA1k3ZuXdW9e1j0FniOWJKkjj4glSerIIJYkqSODWFukJE9JclmS7yU5eoL5j0tyXpJbkzx33LyXJvlue710ZPoXk1yY5OIk/5RkxtZQ98j8U5Os3hpqTnJmW+cF7XXvraTuuyX5QJLvJLk0yXO29LqT3GNkP1+Q5Ookf7el192mL0hyUZJV7d/n7ltJ3Ue0mi9O8s6NLrKqfPnaol7ADOD7wAOBuwEXAgeMazMbeBjwYeC5I9N3A37Qfu7a3u/a5t2z/QzwSeAFW0Pdbf4fAh8DVm8NNQNnAnO3wr8jbwbe1t5vB+y+NdQ9bvmVwOO29LqBmcBPxvYx8C7gmK2g7nsB/wns0dr9K/DEjanTI2JtiQ4BvldVP6iqXwEfB5452qCqLq+qVcCvxy37ZOBLVXVNVV0LfAl4SlvmF63NTIZ/lNN9peImqTvJTsBrgLdNc72brObNYFPV/TLgHW35X1fVdN9daZPu7yT7AfcGvrEV1J322jFJgHsCV20FdT8Q+G5V/bS1OwPYqJETg1hbovsCV458/q82baOXTXIaw7fw64FTNq7M9et7I5Z9K/Bu4H82tsD17Hdjl/1QGyr9P+0/2uk07XUn2aV9fmsbqjw5yX02vtR19z2Ny74AOKnaodo0mva6q+oW4NXARQwBfACwbONLXXffG7ns94DfSTI7yUzgWcD9N6ZIg1h3KVX1ZGBP4O7AEzqXs05JDgL2rapP965lPb2wqh4KPLa9Xty5nqmYCdwPOLuqDgbOAf62b0nr7QXA8t5FTEWS7RmC+OHAXsAq4K+7FjUF7ej41cBJDCMPlwO3bcw6DWJtiX7EHb9h3q9Nm5Zlq+om4N8ZN0Q1DTZF3Y8B5ia5HFgB7JfkzI2udN39btSyVTX283qGc9uHbHSlU+x7I5b9GcOow6fa9JOBgzeuzCn3vdHLJjkQmFlVKze2yPXtewOXPQigqr7fjuA/ARy68aVOqe+NWraqPlNVj6qqxwCXAd/ZmCINYm2Jvg08OMkDktyN4Vv+qVNc9jTgsCS7JtkVOAw4LclOSfYEaMNJTwMu3dLrrqp/rKq9qmo2MA/4TlU9fkuuOcnMsatf21HP04Hpvtp7U+zrAj4DPL61eyJwyfSWPf11j8xfwKY7Gt4Udf8IOCDJ2BOJ/hewZiuom7TfAmjT/xQ4fqOqnM4r1Hz5mq4X8FSGb5nfBxa3aW8BDm/vH8lwzuYGhiOZi0eWfRnDeZzvAX/cpt2H4R/lKoZQWMpw9LBF1z1u3bOZ5qumN9G+3pHhyt1VwMXA3wMztvS62/R9gK+32r8M7L011N3m/QD43emudxPv71cxhO8qhi9B99pK6l7O8CXtEqbhty+8xaUkSR05NC1JUkcGsSRJHRnEkiR1ZBBLktSRQSxJUkcGsaQpS3LbuCf9HN2mn5lkbod6Thj/xJw2/cgke418Pj7JAZug/x2SfC1reZJXkjPa75tKE5rZuwBJW5Ubq+qg3kVMwZEMvy9+FUBV/ckm6udlwKeqam23OPwIw00flmyiGrSV84hY0rRK8o9Jzm3Pan3zyPTLk7yrPX/2W0ke1KY/L8nqDM+K/nqbNiPJcUm+3Z77+so2PUne154vewbDk4bG9/9cYC7w0XbUvsPoEXuSX7Z1X9yOVg9p83+Q5PC19T+BFzLcLpUkeyb5eutzdZLHtjanMtz1SpqQQSxpfewwbmj6iAnaLK6quQzPeP39JA8bmXddDQ+DeB8w9vD6NwJPrqoDgcPbtIWt7SMZ7nz08iQPAJ4N/A7Dk3pewgT3Jq6qU4BzGR48cVBV3TiuyY7AV6rqIQxP4Xobw+0Vn81wx6W19X+7dsvEB1bV5W3SHzHcKvMg4EDgglbPtcDdk9xrgn0lOTQtab1MZWj6+UlewfD/y54MobmqzVs+8vO97f1ZwAlJPsFvHrhwGPCwkfO/OwMPBh4HLG9DwVcl+coGbMOvgC+29xcBN1fVLUkuYriN6Nr6/+HIenYHfj7y+dvAv7T7a/9bVV0wMu8nDE8Y+tkG1KttnEEsadq0o8bXAo+sqmuTnADMGmlS499X1auSPIrhQRwrkzyC4YHxi6pq9KEGJHnqNJR5S/3m3r6/Bm5udfy6PRCEyfof50ZGtq2qvp7kcW07Tkjynqr6cJs9q7WX7sShaUnT6Z4MN8+/Lsl9gD8YN/+IkZ/nACTZt6r+o6reCPyU4dFzpwGvbkeXJNkvyY4MD2Q4op3D3ROYP0kd1wP32IjtmKz/27Uh5xlJZrU2+wA/rqoPMjyN5+A2PcBvMzy3VroTj4glrY8dkowOuX6xqo4e+1BVFyY5n+ERk1cyDDuP2jXJKoaj0LELmI5L8mCGo9AvAxcyDGXPBs5rQfZT4FnAp4EnMDz15j9pYT6BE4B/SnIjwzOd19fxk/Q/3ukMj6c8g+Hxia9LcgvwS4Zz2ACPAL5ZVbduQB26C/DpS5I2iySXA3Or6uretUyXJAcDf1lVL15Lm78HTq2qL2++yrQ1cWhakjZQVZ0HfHVtN/RgeIa0IaxJeUQsSVJHHhFLktSRQSxJUkcGsSRJHRnEkiR1ZBBLktTR/wfsmwc6WQYL2wAAAABJRU5ErkJggg==\n"
},
"metadata": {
"needs_background": "light"
}
}
]
},
{
"cell_type": "markdown",
"source": [
"We can draw the same conclusions for both training and scoring elapsed time: selecting the most informative features speed-up our pipeline. Of course, such speed-up is beneficial only if the generalization performance in terms of metrics remain the same. Let’s check the testing score."
],
"metadata": {
"id": "5tHmNZ85gmPa"
}
},
{
"cell_type": "code",
"source": [
"cv_results[\"test_score\"].plot.box(color=color, vert=False)\n",
"plt.xlabel(\"Accuracy score\")\n",
"_ = plt.title(\"Test score via cross-validation\")"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 295
},
"id": "4onQdLXGgfJn",
"outputId": "3ff34056-5a48-4bb1-abdb-ab2b5ada7f8c"
},
"execution_count": 18,
"outputs": [
{
"output_type": "display_data",
"data": {
"text/plain": [
"
"
],
"image/png": "\n"
},
"metadata": {
"needs_background": "light"
}
}
]
},
{
"cell_type": "markdown",
"source": [
"We can observe that the model’s generalization performance selecting a subset of features decreases compared with the model using all available features. Since we generated the dataset, we can infer that the decrease is because of the selection. The feature selection algorithm did not choose the two informative features."
],
"metadata": {
"id": "n_Ym3R87g4vw"
}
},
{
"cell_type": "code",
"source": [
"for idx, pipeline in enumerate(cv_results_with_selection[\"estimator\"]):\n",
" print(\n",
" f\"Fold #{idx} - features selected are: \"\n",
" f\"{np.argsort(pipeline[0].scores_)[-2:]}\"\n",
" )"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "mTUZGkUPgowf",
"outputId": "87061117-1f82-4759-ac85-ef5ec8385e56"
},
"execution_count": 19,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Fold #0 - features selected are: [89 53]\n",
"Fold #1 - features selected are: [29 53]\n",
"Fold #2 - features selected are: [52 53]\n",
"Fold #3 - features selected are: [49 53]\n",
"Fold #4 - features selected are: [49 53]\n"
]
}
]
},
{
"cell_type": "markdown",
"source": [
"We see that the feature 53 is always selected while the other feature varies depending on the cross-validation fold.\n",
"\n",
"If we would like to keep our score with similar generalization performance, **we could choose another metric to perform the test or select more features.** For instance, we could select the number of features based on a specific percentile of the highest scores."
],
"metadata": {
"id": "RgTtgRS5hBBJ"
}
},
{
"cell_type": "markdown",
"source": [
"#### Mutual information"
],
"metadata": {
"id": "HiK_yc8kk4GI"
}
},
{
"cell_type": "markdown",
"source": [
"The [*Automobile*](https://www.kaggle.com/toramky/automobile-dataset) dataset consists of 193 cars from the 1985 model year. The goal for this dataset is to predict a car's `price` (the target) from 23 of the car's features, such as `make`, `body_style`, and `horsepower`. In this example, we'll rank the features with mutual information and investigate the results by data visualization. (The original dataset requires data cleaning, you could refer to https://skill-lync.com/student-projects/project-1-1299)"
],
"metadata": {
"id": "k-V69-wYlCxA"
}
},
{
"cell_type": "code",
"source": [
"df = pd.read_csv(\"autos.csv\")\n",
"df.head()"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 352
},
"id": "lg40S0_0llto",
"outputId": "d7663ccb-de86-4e56-b1d5-d4c0e5287c48"
},
"execution_count": 33,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" symboling make fuel_type aspiration num_of_doors body_style \\\n",
"0 3 alfa-romero gas std 2 convertible \n",
"1 3 alfa-romero gas std 2 convertible \n",
"2 1 alfa-romero gas std 2 hatchback \n",
"3 2 audi gas std 4 sedan \n",
"4 2 audi gas std 4 sedan \n",
"\n",
" drive_wheels engine_location wheel_base length ... engine_size \\\n",
"0 rwd front 88.6 168.8 ... 130 \n",
"1 rwd front 88.6 168.8 ... 130 \n",
"2 rwd front 94.5 171.2 ... 152 \n",
"3 fwd front 99.8 176.6 ... 109 \n",
"4 4wd front 99.4 176.6 ... 136 \n",
"\n",
" fuel_system bore stroke compression_ratio horsepower peak_rpm city_mpg \\\n",
"0 mpfi 3.47 2.68 9 111 5000 21 \n",
"1 mpfi 3.47 2.68 9 111 5000 21 \n",
"2 mpfi 2.68 3.47 9 154 5000 19 \n",
"3 mpfi 3.19 3.40 10 102 5500 24 \n",
"4 mpfi 3.19 3.40 8 115 5500 18 \n",
"\n",
" highway_mpg price \n",
"0 27 13495 \n",
"1 27 16500 \n",
"2 26 16500 \n",
"3 30 13950 \n",
"4 22 17450 \n",
"\n",
"[5 rows x 25 columns]"
],
"text/html": [
"\n",
"
\n",
"
\n",
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
symboling
\n",
"
make
\n",
"
fuel_type
\n",
"
aspiration
\n",
"
num_of_doors
\n",
"
body_style
\n",
"
drive_wheels
\n",
"
engine_location
\n",
"
wheel_base
\n",
"
length
\n",
"
...
\n",
"
engine_size
\n",
"
fuel_system
\n",
"
bore
\n",
"
stroke
\n",
"
compression_ratio
\n",
"
horsepower
\n",
"
peak_rpm
\n",
"
city_mpg
\n",
"
highway_mpg
\n",
"
price
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
3
\n",
"
alfa-romero
\n",
"
gas
\n",
"
std
\n",
"
2
\n",
"
convertible
\n",
"
rwd
\n",
"
front
\n",
"
88.6
\n",
"
168.8
\n",
"
...
\n",
"
130
\n",
"
mpfi
\n",
"
3.47
\n",
"
2.68
\n",
"
9
\n",
"
111
\n",
"
5000
\n",
"
21
\n",
"
27
\n",
"
13495
\n",
"
\n",
"
\n",
"
1
\n",
"
3
\n",
"
alfa-romero
\n",
"
gas
\n",
"
std
\n",
"
2
\n",
"
convertible
\n",
"
rwd
\n",
"
front
\n",
"
88.6
\n",
"
168.8
\n",
"
...
\n",
"
130
\n",
"
mpfi
\n",
"
3.47
\n",
"
2.68
\n",
"
9
\n",
"
111
\n",
"
5000
\n",
"
21
\n",
"
27
\n",
"
16500
\n",
"
\n",
"
\n",
"
2
\n",
"
1
\n",
"
alfa-romero
\n",
"
gas
\n",
"
std
\n",
"
2
\n",
"
hatchback
\n",
"
rwd
\n",
"
front
\n",
"
94.5
\n",
"
171.2
\n",
"
...
\n",
"
152
\n",
"
mpfi
\n",
"
2.68
\n",
"
3.47
\n",
"
9
\n",
"
154
\n",
"
5000
\n",
"
19
\n",
"
26
\n",
"
16500
\n",
"
\n",
"
\n",
"
3
\n",
"
2
\n",
"
audi
\n",
"
gas
\n",
"
std
\n",
"
4
\n",
"
sedan
\n",
"
fwd
\n",
"
front
\n",
"
99.8
\n",
"
176.6
\n",
"
...
\n",
"
109
\n",
"
mpfi
\n",
"
3.19
\n",
"
3.40
\n",
"
10
\n",
"
102
\n",
"
5500
\n",
"
24
\n",
"
30
\n",
"
13950
\n",
"
\n",
"
\n",
"
4
\n",
"
2
\n",
"
audi
\n",
"
gas
\n",
"
std
\n",
"
4
\n",
"
sedan
\n",
"
4wd
\n",
"
front
\n",
"
99.4
\n",
"
176.6
\n",
"
...
\n",
"
136
\n",
"
mpfi
\n",
"
3.19
\n",
"
3.40
\n",
"
8
\n",
"
115
\n",
"
5500
\n",
"
18
\n",
"
22
\n",
"
17450
\n",
"
\n",
" \n",
"
\n",
"
5 rows × 25 columns
\n",
"
\n",
" \n",
" \n",
" \n",
"\n",
" \n",
"
\n",
"
\n",
" "
]
},
"metadata": {},
"execution_count": 33
}
]
},
{
"cell_type": "markdown",
"source": [
"The scikit-learn algorithm for MI treats discrete features differently from continuous features. Consequently, you need to tell it which are which. As a rule of thumb, anything that *must* have a `float` dtype is *not* discrete. Categoricals (`object` or `categorial` dtype) can be treated as discrete by giving them a label encoding"
],
"metadata": {
"id": "YX3VJQKJl4hf"
}
},
{
"cell_type": "code",
"source": [
"X = df.copy()\n",
"y = X.pop(\"price\")\n",
"\n",
"# Label encoding for categoricals\n",
"for colname in X.select_dtypes(\"object\"):\n",
" X[colname], _ = X[colname].factorize()\n",
"\n",
"# All discrete features should now have integer dtypes (double-check this before using MI!)\n",
"discrete_features = X.dtypes == int"
],
"metadata": {
"id": "VOVazcEFlv6Q"
},
"execution_count": 34,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Scikit-learn has two mutual information metrics in its `feature_selection` module: one for real-valued targets (`mutual_info_regression`) and one for categorical targets (`mutual_info_classif`). Our target, `price`, is real-valued. The next cell computes the MI scores for our features and wraps them up in a nice dataframe."
],
"metadata": {
"id": "SgQfNAYNmGhw"
}
},
{
"cell_type": "code",
"source": [
"def make_mi_scores(X, y, discrete_features):\n",
" mi_scores = mutual_info_regression(X, y, discrete_features=discrete_features)\n",
" mi_scores = pd.Series(mi_scores, name=\"MI Scores\", index=X.columns)\n",
" mi_scores = mi_scores.sort_values(ascending=False)\n",
" return mi_scores\n",
"\n",
"mi_scores = make_mi_scores(X, y, discrete_features)\n",
"mi_scores[::3] # show a few features with their MI scores"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "36ZK4lJNmDZi",
"outputId": "b67fcb48-2df8-4f98-f0c9-397a177f06f5"
},
"execution_count": 35,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"curb_weight 1.552832\n",
"highway_mpg 0.959290\n",
"length 0.615354\n",
"bore 0.504682\n",
"stroke 0.391373\n",
"num_of_cylinders 0.330589\n",
"compression_ratio 0.134892\n",
"fuel_type 0.047279\n",
"Name: MI Scores, dtype: float64"
]
},
"metadata": {},
"execution_count": 35
}
]
},
{
"cell_type": "code",
"source": [
"def plot_mi_scores(scores):\n",
" scores = scores.sort_values(ascending=True)\n",
" width = np.arange(len(scores))\n",
" ticks = list(scores.index)\n",
" plt.barh(width, scores)\n",
" plt.yticks(width, ticks)\n",
" plt.title(\"Mutual Information Scores\")\n",
"\n",
"\n",
"plt.figure(dpi=100, figsize=(8, 5))\n",
"plot_mi_scores(mi_scores)"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 460
},
"id": "Vi9tYOlEmKvL",
"outputId": "c741f89c-0c60-4f4d-fa84-d02500098898"
},
"execution_count": 36,
"outputs": [
{
"output_type": "display_data",
"data": {
"text/plain": [
"
"
],
"image/png": "\n"
},
"metadata": {
"needs_background": "light"
}
}
]
},
{
"cell_type": "markdown",
"source": [
"The `fuel_type` feature has a fairly low MI score, but as we can see from the figure, it clearly separates two `price` populations with different trends within the `horsepower` feature. **This indicates that `fuel_type` contributes an interaction effect and might not be unimportant after all.** Before deciding a feature is unimportant from its MI score, it's good to investigate any possible interaction effects -- domain knowledge can offer a lot of guidance here."
],
"metadata": {
"id": "A34J5fAZpNJO"
}
},
{
"cell_type": "code",
"source": [
"sns.lmplot(x=\"horsepower\", y=\"price\", hue=\"fuel_type\", data=df);"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 369
},
"id": "TCwOeNdepGUP",
"outputId": "7255762f-7168-44a9-cd38-6c578861b4c1"
},
"execution_count": 38,
"outputs": [
{
"output_type": "display_data",
"data": {
"text/plain": [
"
"
],
"image/png": "\n"
},
"metadata": {
"needs_background": "light"
}
}
]
},
{
"cell_type": "markdown",
"source": [
"### Sequential feature selection"
],
"metadata": {
"id": "o0rWGFi9pvq4"
}
},
{
"cell_type": "markdown",
"source": [
"Sequential Feature Selection is available in the `SequentialFeatureSelector` transformer. SFS can be either forward or backward:\n",
"\n",
"* Forward-SFS is a greedy procedure that iteratively finds the best new feature to add to the set of selected features. Concretely, we initially start with zero feature and find the one feature that maximizes a cross-validated score when an estimator is trained on this single feature. Once that first feature is selected, we repeat the procedure by adding a new feature to the set of selected features. The procedure stops when the desired number of selected features is reached, as determined by the n_features_to_select parameter.\n",
"\n",
"* Backward-SFS follows the same idea but works in the opposite direction: instead of starting with no feature and greedily adding features, we start with all the features and greedily remove features from the set. The direction parameter controls whether forward or backward SFS is used.\n",
"\n",
"In general, forward and backward selection do not yield equivalent results. Also, one may be much faster than the other depending on the requested number of selected features: if we have 10 features and ask for 7 selected features, forward selection would need to perform 7 iterations while backward selection would only need to perform 3."
],
"metadata": {
"id": "wAQBoFEVp1X1"
}
},
{
"cell_type": "code",
"source": [
"X, y = load_iris(return_X_y=True)\n",
"knn = KNeighborsClassifier(n_neighbors=3)"
],
"metadata": {
"id": "XN-VAbTRpV_I"
},
"execution_count": 41,
"outputs": []
},
{
"cell_type": "code",
"source": [
"sfs = SequentialFeatureSelector(knn, n_features_to_select=3)\n",
"sfs.fit(X, y)"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "FWjzoHnk-Orv",
"outputId": "2d761371-31d3-4f4b-ed23-a9b2c9e865e7"
},
"execution_count": 42,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"SequentialFeatureSelector(estimator=KNeighborsClassifier(n_neighbors=3),\n",
" n_features_to_select=3)"
]
},
"metadata": {},
"execution_count": 42
}
]
},
{
"cell_type": "code",
"source": [
"sfs.transform(X).shape"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "8pYRsVKX-nk8",
"outputId": "06c63564-28e1-4912-9a03-e99e64660ed7"
},
"execution_count": 43,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"(150, 3)"
]
},
"metadata": {},
"execution_count": 43
}
]
},
{
"cell_type": "markdown",
"source": [
"### Feature selection from model"
],
"metadata": {
"id": "Wkh5pr70-sCN"
}
},
{
"cell_type": "markdown",
"source": [
"`SelectFromModel` is a meta-transformer that can be used alongside any estimator that assigns importance to each feature through a specific attribute (such as `coef_`, `feature_importances_`) or via an `importance_getter` callable after fitting. The features are considered unimportant and removed if the corresponding importance of the feature values are below the provided threshold parameter. \n",
"\n",
"Apart from specifying the threshold numerically, there are built-in heuristics for finding a threshold using a string argument. Available heuristics are “mean”, “median” and float multiples of these like “0.1*mean”. In combination with the threshold criteria, one can use the `max_features` parameter to set a limit on the number of features to select."
],
"metadata": {
"id": "cAf7ARk0Howp"
}
},
{
"cell_type": "code",
"source": [
"X, y = load_iris(return_X_y=True)"
],
"metadata": {
"id": "gkWQgTpd-pwt"
},
"execution_count": 45,
"outputs": []
},
{
"cell_type": "code",
"source": [
"lsvc = LinearSVC(C=0.01, penalty=\"l1\", dual=False).fit(X, y)"
],
"metadata": {
"id": "NvnykC0WJC4_"
},
"execution_count": 48,
"outputs": []
},
{
"cell_type": "code",
"source": [
"model = SelectFromModel(lsvc, prefit=True)"
],
"metadata": {
"id": "jsBkMzF4JFAb"
},
"execution_count": 49,
"outputs": []
},
{
"cell_type": "code",
"source": [
"X_new = model.transform(X)\n",
"X_new.shape"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "WbuJtKY2JFRg",
"outputId": "d9739206-61be-4a02-dec1-4e5dc3957a57"
},
"execution_count": 50,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"(150, 3)"
]
},
"metadata": {},
"execution_count": 50
}
]
},
{
"cell_type": "markdown",
"source": [
"### A Concret example"
],
"metadata": {
"id": "3zP7dVWRLgmn"
}
},
{
"cell_type": "markdown",
"source": [
"The following dataset is a record of neighborhoods in California district, predicting the median house value (target) given some information about the neighborhoods, as the average number of rooms, the latitude, the longitude or the median income of people in the neighborhoods (block)."
],
"metadata": {
"id": "k8srd8jgJZ94"
}
},
{
"cell_type": "code",
"source": [
"X, y = fetch_california_housing(as_frame=True, return_X_y=True)"
],
"metadata": {
"id": "qT8aM1mqJN3J"
},
"execution_count": 52,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# To speed up the computation, we take the first 10000 samples\n",
"X = X[:10000]\n",
"y = y[:10000]\n",
"X.head()"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 206
},
"id": "nUIwkfs4JlWO",
"outputId": "ad99d5a2-70ed-47c2-d5be-11aa243806b2"
},
"execution_count": 53,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude \\\n",
"0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88 \n",
"1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86 \n",
"2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85 \n",
"3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85 \n",
"4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85 \n",
"\n",
" Longitude \n",
"0 -122.23 \n",
"1 -122.22 \n",
"2 -122.24 \n",
"3 -122.25 \n",
"4 -122.25 "
],
"text/html": [
"\n",
"
\n",
"
\n",
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
MedInc
\n",
"
HouseAge
\n",
"
AveRooms
\n",
"
AveBedrms
\n",
"
Population
\n",
"
AveOccup
\n",
"
Latitude
\n",
"
Longitude
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
8.3252
\n",
"
41.0
\n",
"
6.984127
\n",
"
1.023810
\n",
"
322.0
\n",
"
2.555556
\n",
"
37.88
\n",
"
-122.23
\n",
"
\n",
"
\n",
"
1
\n",
"
8.3014
\n",
"
21.0
\n",
"
6.238137
\n",
"
0.971880
\n",
"
2401.0
\n",
"
2.109842
\n",
"
37.86
\n",
"
-122.22
\n",
"
\n",
"
\n",
"
2
\n",
"
7.2574
\n",
"
52.0
\n",
"
8.288136
\n",
"
1.073446
\n",
"
496.0
\n",
"
2.802260
\n",
"
37.85
\n",
"
-122.24
\n",
"
\n",
"
\n",
"
3
\n",
"
5.6431
\n",
"
52.0
\n",
"
5.817352
\n",
"
1.073059
\n",
"
558.0
\n",
"
2.547945
\n",
"
37.85
\n",
"
-122.25
\n",
"
\n",
"
\n",
"
4
\n",
"
3.8462
\n",
"
52.0
\n",
"
6.281853
\n",
"
1.081081
\n",
"
565.0
\n",
"
2.181467
\n",
"
37.85
\n",
"
-122.25
\n",
"
\n",
" \n",
"
\n",
"
\n",
" \n",
" \n",
" \n",
"\n",
" \n",
"
\n",
"
\n",
" "
]
},
"metadata": {},
"execution_count": 53
}
]
},
{
"cell_type": "markdown",
"source": [
"The feature reads as follow:\n",
"\n",
"* MedInc: median income in block\n",
"* HouseAge: median house age in block\n",
"* AveRooms: average number of rooms\n",
"* AveBedrms: average number of bedrooms\n",
"* Population: block population\n",
"* AveOccup: average house occupancy\n",
"* Latitude: house block latitude\n",
"* Longitude: house block longitude\n",
"* MedHouseVal: Median house value in 100k$ (target)\n",
"\n",
"To assert the quality of our inspection technique, let’s add some random feature that won’t help the prediction (un-informative feature)"
],
"metadata": {
"id": "8i7IQeucJsWY"
}
},
{
"cell_type": "code",
"source": [
"# Adding random features\n",
"rng = np.random.RandomState(0)\n",
"bin_var = pd.Series(rng.randint(0, 1, X.shape[0]), name='rnd_bin')\n",
"num_var = pd.Series(np.arange(X.shape[0]), name='rnd_num')\n",
"X_with_rnd_feat = pd.concat((X, bin_var, num_var), axis=1)"
],
"metadata": {
"id": "SLi9SoU6JqWe"
},
"execution_count": 54,
"outputs": []
},
{
"cell_type": "code",
"source": [
"X_train, X_test, y_train, y_test = train_test_split(X_with_rnd_feat, y, random_state=42)"
],
"metadata": {
"id": "OjVGGwhmJz91"
},
"execution_count": 57,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"In linear models, the target value is modeled as a linear combination of the features."
],
"metadata": {
"id": "m8Byb20-KF5I"
}
},
{
"cell_type": "code",
"source": [
"model = RidgeCV()\n",
"\n",
"model.fit(X_train, y_train)\n",
"\n",
"print(f'model score on training data: {model.score(X_train, y_train)}')\n",
"print(f'model score on testing data: {model.score(X_test, y_test)}')"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "CVQ3s5uuJ83F",
"outputId": "829c716a-43ca-4701-d474-d7bc862f15c8"
},
"execution_count": 60,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"model score on training data: 0.6048814128047645\n",
"model score on testing data: 0.5866391379089506\n"
]
}
]
},
{
"cell_type": "markdown",
"source": [
"Our linear model obtains a $R^2$ score of .60, so it explains a significant part of the target. Its coefficient should be somehow relevant. Let’s look at the coefficient learnt"
],
"metadata": {
"id": "41HBpjFYKe4a"
}
},
{
"cell_type": "code",
"source": [
"coefs = pd.DataFrame(\n",
" model.coef_,\n",
" columns=['Coefficients'], index=X_train.columns\n",
")\n",
"\n",
"coefs.plot(kind='barh', figsize=(9, 7))\n",
"plt.title('Ridge model')\n",
"plt.axvline(x=0, color='.5')\n",
"plt.subplots_adjust(left=.3)"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 444
},
"id": "tdWavBbTKZxa",
"outputId": "5f300833-73c9-463b-9666-95a1863b9925"
},
"execution_count": 61,
"outputs": [
{
"output_type": "display_data",
"data": {
"text/plain": [
"
"
],
"image/png": "\n"
},
"metadata": {
"needs_background": "light"
}
}
]
},
{
"cell_type": "markdown",
"source": [
"The `AveBedrms` have the higher coefficient. However, we can’t compare the magnitude of these coefficients directly, since they are not scaled. Indeed, `Population` is an integer which can be thousands, while `AveBedrms` is around 4 and `Latitude` is in degree.\n",
"\n",
"So the Population coefficient is expressed in `“100k$/habitant”` while the `AveBedrms` is expressed in `“100k$/nb of bedrooms”` and the Latitude coefficient in `“100k$/degree”`. We see that changing population by one does not change the outcome, while as we go south (latitude increase) the price becomes cheaper. Also, adding a bedroom (keeping all other feature constant) shall rise the price of the house by `80k$`.\n",
"\n",
"So looking at the coefficient plot to gauge feature importance can be misleading as some of them vary on a small scale, while others vary a lot more, several decades. So before any interpretation, we need to scale each column (removing the mean and scaling the variance to 1)."
],
"metadata": {
"id": "BEwY2u2PKrvR"
}
},
{
"cell_type": "code",
"source": [
"model = make_pipeline(StandardScaler(), RidgeCV())\n",
"\n",
"model.fit(X_train, y_train)\n",
"\n",
"print(f'model score on training data: {model.score(X_train, y_train)}')\n",
"print(f'model score on testing data: {model.score(X_test, y_test)}')"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "hk5ebi7xKlJ0",
"outputId": "e3182ee2-6c9c-4ddf-b935-a545c423df65"
},
"execution_count": 65,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"model score on training data: 0.6048511948222112\n",
"model score on testing data: 0.5863381274564599\n"
]
}
]
},
{
"cell_type": "code",
"source": [
"coefs = pd.DataFrame(\n",
" model[1].coef_,\n",
" columns=['Coefficients'], index=X_train.columns\n",
")\n",
"\n",
"coefs.plot(kind='barh', figsize=(9, 7))\n",
"plt.title('Ridge model')\n",
"plt.axvline(x=0, color='.5')\n",
"plt.subplots_adjust(left=.3)"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 444
},
"id": "8KjhPtSzLP3P",
"outputId": "9288292a-4415-4585-c361-4c75afa51828"
},
"execution_count": 66,
"outputs": [
{
"output_type": "display_data",
"data": {
"text/plain": [
"
"
],
"image/png": "\n"
},
"metadata": {
"needs_background": "light"
}
}
]
},
{
"cell_type": "markdown",
"source": [
"Now that the coefficients have been scaled, we can safely compare them. The median income feature, with longitude and latitude are the three variables that most influence the model.\n",
"\n",
"The plot above tells us about dependencies between a specific feature and the target when all other features remain constant, i.e., conditional dependencies. An increase of the `HouseAge` will induce an increase of the price when all other features remain constant. On the contrary, an increase of the average rooms will induce an decrease of the price when all other features remain constant."
],
"metadata": {
"id": "NehlVHi2LWHY"
}
},
{
"cell_type": "markdown",
"source": [
"We can check the coefficient variability through cross-validation: it is a form of data perturbation."
],
"metadata": {
"id": "ux2-205wLmSh"
}
},
{
"cell_type": "code",
"source": [
"cv_model = cross_validate(\n",
" model, X_with_rnd_feat, y, cv=RepeatedKFold(n_splits=5, n_repeats=5),\n",
" return_estimator=True, n_jobs=2\n",
")\n",
"coefs = pd.DataFrame(\n",
" [model[1].coef_\n",
" for model in cv_model['estimator']],\n",
" columns=X_with_rnd_feat.columns\n",
")\n",
"plt.figure(figsize=(9, 7))\n",
"sns.boxplot(data=coefs, orient='h', color='cyan', saturation=0.5)\n",
"plt.axvline(x=0, color='.5')\n",
"plt.xlabel('Coefficient importance')\n",
"plt.title('Coefficient importance and its variability')\n",
"plt.subplots_adjust(left=.3)"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 458
},
"id": "deiQJiPNLR1_",
"outputId": "e4238395-ff2a-4a94-a018-ba1fea30d4e4"
},
"execution_count": 64,
"outputs": [
{
"output_type": "display_data",
"data": {
"text/plain": [
"
"
],
"image/png": "\n"
},
"metadata": {
"needs_background": "light"
}
}
]
},
{
"cell_type": "markdown",
"source": [
"Now if we want to select the four features which are the most important according to the coefficients. The `SelectFromModel` is meant just for that. `SelectFromModel` accepts a `threshold` parameter and will select the features whose importance (defined by the coefficients) are above this threshold."
],
"metadata": {
"id": "wL8szAkwMLto"
}
},
{
"cell_type": "code",
"source": [
"model"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "T8C_GOhgM1L4",
"outputId": "6512cc7e-be65-4d42-ee28-52c26e00aadb"
},
"execution_count": 67,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"Pipeline(steps=[('standardscaler', StandardScaler()),\n",
" ('ridgecv', RidgeCV(alphas=array([ 0.1, 1. , 10. ])))])"
]
},
"metadata": {},
"execution_count": 67
}
]
},
{
"cell_type": "code",
"source": [
"importance = np.abs(model[1].coef_)\n",
"threshold = np.sort(importance)[-5] + 0.01"
],
"metadata": {
"id": "JrrrUqq4Lo2S"
},
"execution_count": 68,
"outputs": []
},
{
"cell_type": "code",
"source": [
"feature_names = np.array(X.columns)\n",
"sfm = SelectFromModel(model[1], threshold=threshold).fit(X, y)\n",
"print(f\"Features selected by SelectFromModel: {feature_names[sfm.get_support()]}\")"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "nqtT14UoM4OS",
"outputId": "603fb5b4-c3b5-4175-9079-8c8370c9f942"
},
"execution_count": 72,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Features selected by SelectFromModel: ['MedInc' 'AveBedrms' 'Latitude' 'Longitude']\n"
]
}
]
},
{
"cell_type": "markdown",
"source": [
"#### Linear models with sparse coefficients (Lasso)"
],
"metadata": {
"id": "25OIEBwCNZUw"
}
},
{
"cell_type": "markdown",
"source": [
"In it important to keep in mind that the associations extracted depend on the model. To illustrate this point we consider a Lasso model, that performs feature selection with a L1 penalty. Let us fit a Lasso model with a strong regularization parameters alpha"
],
"metadata": {
"id": "vHsNnqoQNhsT"
}
},
{
"cell_type": "code",
"source": [
"model = make_pipeline(StandardScaler(), Lasso(alpha=.015))\n",
"\n",
"model.fit(X_train, y_train)\n",
"\n",
"print(f'model score on training data: {model.score(X_train, y_train)}')\n",
"print(f'model score on testing data: {model.score(X_test, y_test)}')"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "LWLjWFkmM9LL",
"outputId": "e3de2fb1-32ed-4ab2-bb9d-8fd673368fb7"
},
"execution_count": 73,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"model score on training data: 0.5933235371761756\n",
"model score on testing data: 0.5673786563118284\n"
]
}
]
},
{
"cell_type": "code",
"source": [
"coefs = pd.DataFrame(\n",
" model[1].coef_,\n",
" columns=['Coefficients'], index=X_train.columns\n",
")\n",
"\n",
"coefs.plot(kind='barh', figsize=(9, 7))\n",
"plt.title('Lasso model, strong regularization')\n",
"plt.axvline(x=0, color='.5')\n",
"plt.subplots_adjust(left=.3)"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 444
},
"id": "LTDzbcc6NlYv",
"outputId": "3f8a1cd1-2eab-4560-c2b2-394f0a196389"
},
"execution_count": 74,
"outputs": [
{
"output_type": "display_data",
"data": {
"text/plain": [
"
"
],
"image/png": "\n"
},
"metadata": {
"needs_background": "light"
}
}
]
},
{
"cell_type": "markdown",
"source": [
"Here the model score is a bit lower, because of the strong regularization. However, it has zeroed out 3 coefficients, selecting a small number of variables to make its prediction."
],
"metadata": {
"id": "eCstXIgINwgh"
}
},
{
"cell_type": "markdown",
"source": [
"#### Randomforest with feature importance"
],
"metadata": {
"id": "9xppGbQ4N1Zw"
}
},
{
"cell_type": "markdown",
"source": [
"On some algorithms, there are some feature importance methods, inherently built within the model. It is the case in RandomForest models. Let’s investigate the built-in feature_importances_ attribute."
],
"metadata": {
"id": "EZi4Yk_hN935"
}
},
{
"cell_type": "code",
"source": [
"model = RandomForestRegressor()\n",
"\n",
"model.fit(X_train, y_train)\n",
"\n",
"print(f'model score on training data: {model.score(X_train, y_train)}')\n",
"print(f'model score on testing data: {model.score(X_test, y_test)}')"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "0BZSUHpLNn5X",
"outputId": "4b1440c3-908d-4d09-9138-71862efa6378"
},
"execution_count": 75,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"model score on training data: 0.9796271614609334\n",
"model score on testing data: 0.8457060700865664\n"
]
}
]
},
{
"cell_type": "code",
"source": [
"importances = model.feature_importances_\n",
"indices = np.argsort(importances)\n",
"\n",
"fig, ax = plt.subplots()\n",
"ax.barh(range(len(importances)), importances[indices])\n",
"ax.set_yticks(range(len(importances)))\n",
"_ = ax.set_yticklabels(np.array(X_train.columns)[indices])"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 265
},
"id": "xBAIYWWFOCcy",
"outputId": "dd7b532f-70d7-4f43-8958-2427e1cc7ff7"
},
"execution_count": 76,
"outputs": [
{
"output_type": "display_data",
"data": {
"text/plain": [
"
"
],
"image/png": "\n"
},
"metadata": {
"needs_background": "light"
}
}
]
},
{
"cell_type": "markdown",
"source": [
"Median income is still the most important feature. It also has a small bias toward high cardinality features, such as the noisy feature `rnd_num`, which are here predicted having `0.07` importance, more than `HouseAge` (which has low cardinality)."
],
"metadata": {
"id": "XbR4LmgUOLUA"
}
},
{
"cell_type": "markdown",
"source": [
"#### Feature importance by permutation"
],
"metadata": {
"id": "NiOjvqYDOVhR"
}
},
{
"cell_type": "markdown",
"source": [
"We introduce here a new technique to evaluate the feature importance of any given fitted model. It basically shuffles a feature and sees how the model changes its prediction. Thus, the change in prediction will correspond to the feature importance."
],
"metadata": {
"id": "HGPNltcJOYlo"
}
},
{
"cell_type": "code",
"source": [
"# Any model could be used here\n",
"\n",
"\n",
"model = RandomForestRegressor()\n",
"model.fit(X_train, y_train)\n",
"\n",
"print(f'model score on training data: {model.score(X_train, y_train)}')\n",
"print(f'model score on testing data: {model.score(X_test, y_test)}')"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "W20bLMRSOJJq",
"outputId": "478b016a-fc0f-4e4a-93a5-d1ff38504db5"
},
"execution_count": 77,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"model score on training data: 0.9795237577232964\n",
"model score on testing data: 0.8467958072484991\n"
]
}
]
},
{
"cell_type": "code",
"source": [
"r = permutation_importance(model, X_test, y_test, n_repeats=30, random_state=42)"
],
"metadata": {
"id": "5J9M7E1vOmMN"
},
"execution_count": 79,
"outputs": []
},
{
"cell_type": "code",
"source": [
"fig, ax = plt.subplots()\n",
"\n",
"indices = r.importances_mean.argsort()\n",
"plt.barh(range(len(indices)), r.importances_mean[indices], xerr=r.importances_std[indices])\n",
"\n",
"ax.set_yticks(range(len(indices)))\n",
"_ = ax.set_yticklabels(np.array(X_train.columns)[indices])"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 265
},
"id": "gNU5FFTsPToC",
"outputId": "098861ca-aedf-4d99-902a-c9cb205b0eaa"
},
"execution_count": 80,
"outputs": [
{
"output_type": "display_data",
"data": {
"text/plain": [
"
"
],
"image/png": "\n"
},
"metadata": {
"needs_background": "light"
}
}
]
},
{
"cell_type": "markdown",
"source": [
"We see again that the feature `MedInc`, Latitude and Longitude are very important for the model. We note that our random variable `rnd_num` is now very less important than latitude. Indeed, the feature importance built-in in `RandomForest` has bias for continuous data, such as `AveOccup` and `rnd_num`."
],
"metadata": {
"id": "ZSzzW-1OQL0Y"
}
},
{
"cell_type": "markdown",
"source": [
"#### Feature rejection using Boruta"
],
"metadata": {
"id": "I1T6QCk1QpF_"
}
},
{
"cell_type": "code",
"source": [
"# define Boruta feature selection method\n",
"feat_selector = BorutaPy(model, n_estimators='auto', verbose=2, random_state=1)"
],
"metadata": {
"id": "El1azRgZQohX"
},
"execution_count": 85,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# find all relevant features \n",
"feat_selector.fit(X_train.values, y_train.values)"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "zlVL9xjvP3de",
"outputId": "ba1dfec2-31cd-4947-807d-820b967ebdba"
},
"execution_count": 89,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Iteration: \t1 / 100\n",
"Confirmed: \t0\n",
"Tentative: \t10\n",
"Rejected: \t0\n",
"Iteration: \t2 / 100\n",
"Confirmed: \t0\n",
"Tentative: \t10\n",
"Rejected: \t0\n",
"Iteration: \t3 / 100\n",
"Confirmed: \t0\n",
"Tentative: \t10\n",
"Rejected: \t0\n",
"Iteration: \t4 / 100\n",
"Confirmed: \t0\n",
"Tentative: \t10\n",
"Rejected: \t0\n",
"Iteration: \t5 / 100\n",
"Confirmed: \t0\n",
"Tentative: \t10\n",
"Rejected: \t0\n",
"Iteration: \t6 / 100\n",
"Confirmed: \t0\n",
"Tentative: \t10\n",
"Rejected: \t0\n",
"Iteration: \t7 / 100\n",
"Confirmed: \t0\n",
"Tentative: \t10\n",
"Rejected: \t0\n",
"Iteration: \t8 / 100\n",
"Confirmed: \t9\n",
"Tentative: \t0\n",
"Rejected: \t1\n",
"\n",
"\n",
"BorutaPy finished running.\n",
"\n",
"Iteration: \t9 / 100\n",
"Confirmed: \t9\n",
"Tentative: \t0\n",
"Rejected: \t1\n"
]
},
{
"output_type": "execute_result",
"data": {
"text/plain": [
"BorutaPy(estimator=RandomForestRegressor(n_estimators=44,\n",
" random_state=RandomState(MT19937) at 0x7F0639E28E20),\n",
" n_estimators='auto',\n",
" random_state=RandomState(MT19937) at 0x7F0639E28E20, verbose=2)"
]
},
"metadata": {},
"execution_count": 89
}
]
},
{
"cell_type": "code",
"source": [
"# check selected features \n",
"np.array(X_train.columns)[feat_selector.support_]"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "V5aQGB1cRWYY",
"outputId": "fdfd1412-4330-41fb-8288-558ac0186f1b"
},
"execution_count": 92,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"array(['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population',\n",
" 'AveOccup', 'Latitude', 'Longitude', 'rnd_num'], dtype=object)"
]
},
"metadata": {},
"execution_count": 92
}
]
},
{
"cell_type": "code",
"source": [
"# check ranking of features\n",
"feat_selector.ranking_"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "WaQ2itMTRX99",
"outputId": "3b0688dc-ff55-4104-d169-154a8a71a7d8"
},
"execution_count": 91,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"array([1, 1, 1, 1, 1, 1, 1, 1, 2, 1])"
]
},
"metadata": {},
"execution_count": 91
}
]
},
{
"cell_type": "code",
"source": [
"# call transform() on X to filter it down to selected features\n",
"X_filtered = feat_selector.transform(X_train.values)\n",
"X_filtered.shape"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "UGaMazEIRb74",
"outputId": "1c97eacc-b0c8-465f-a603-9f526d9f559b"
},
"execution_count": 95,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"(7500, 9)"
]
},
"metadata": {},
"execution_count": 95
}
]
},
{
"cell_type": "markdown",
"source": [
"## Dimensional reduction"
],
"metadata": {
"id": "tILVdZOoUA8f"
}
},
{
"cell_type": "markdown",
"source": [
"We now looked at our model-based method for feature engineering: principal component analysis (PCA). You could think of PCA as a partitioning of the variation in the data. PCA is a great tool to help you discover important relationships in the data and can also be used to create more informative features."
],
"metadata": {
"id": "GOuAQ5oUUPTP"
}
},
{
"cell_type": "markdown",
"source": [
"There are two ways you could use PCA for feature engineering.\n",
"\n",
"The first way is to use it as a descriptive technique. Since the components tell you about the variation, **you could compute the MI scores for the components and see what kind of variation is most predictive of your target.** That could give you ideas for kinds of features to create -- a product of `'Height'` and `'Diameter'` if `'Size'` is important, say, or a ratio of `'Height'` and `'Diameter'` if `Shape` is important. You could even try clustering on one or more of the high-scoring components.\n",
"\n",
"The second way is to use the components themselves as features. Because the components expose the variational structure of the data directly, **they can often be more informative than the original features.** Here are some use-cases:\n",
"- **Dimensionality reduction**: When your features are highly redundant (*multicollinear*, specifically), PCA will partition out the redundancy into one or more near-zero variance components, which you can then drop since they will contain little or no information.\n",
"- **Anomaly detection**: Unusual variation, not apparent from the original features, will often show up in the low-variance components. These components could be highly informative in an anomaly or outlier detection task.\n",
"- **Noise reduction**: A collection of sensor readings will often share some common background noise. PCA can sometimes collect the (informative) signal into a smaller number of features while leaving the noise alone, thus boosting the signal-to-noise ratio.\n",
"- **Decorrelation**: Some ML algorithms struggle with highly-correlated features. PCA transforms correlated features into uncorrelated components, which could be easier for your algorithm to work with."
],
"metadata": {
"id": "x1CYu0lXU4vG"
}
},
{
"cell_type": "markdown",
"source": [
"PCA basically gives you direct access to the correlational structure of your data. You'll no doubt come up with applications of your own!"
],
"metadata": {
"id": "FBYABylCVU9_"
}
},
{
"cell_type": "code",
"source": [
"def plot_variance(pca, width=8, dpi=100):\n",
" # Create figure\n",
" fig, axs = plt.subplots(1, 2)\n",
" n = pca.n_components_\n",
" grid = np.arange(1, n + 1)\n",
" # Explained variance\n",
" evr = pca.explained_variance_ratio_\n",
" axs[0].bar(grid, evr)\n",
" axs[0].set(\n",
" xlabel=\"Component\", title=\"% Explained Variance\", ylim=(0.0, 1.0)\n",
" )\n",
" # Cumulative Variance\n",
" cv = np.cumsum(evr)\n",
" axs[1].plot(np.r_[0, grid], np.r_[0, cv], \"o-\")\n",
" axs[1].set(\n",
" xlabel=\"Component\", title=\"% Cumulative Variance\", ylim=(0.0, 1.0)\n",
" )\n",
" # Set up figure\n",
" fig.set(figwidth=8, dpi=100)\n",
" return axs"
],
"metadata": {
"id": "y4WY_DqCUEsx"
},
"execution_count": 97,
"outputs": []
},
{
"cell_type": "code",
"source": [
"df = pd.read_csv(\"autos.csv\")\n",
"df.head()"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 352
},
"id": "KY-9l7p1VtRM",
"outputId": "d3ba83cc-7903-4504-e080-6393ab5e91a5"
},
"execution_count": 98,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" symboling make fuel_type aspiration num_of_doors body_style \\\n",
"0 3 alfa-romero gas std 2 convertible \n",
"1 3 alfa-romero gas std 2 convertible \n",
"2 1 alfa-romero gas std 2 hatchback \n",
"3 2 audi gas std 4 sedan \n",
"4 2 audi gas std 4 sedan \n",
"\n",
" drive_wheels engine_location wheel_base length ... engine_size \\\n",
"0 rwd front 88.6 168.8 ... 130 \n",
"1 rwd front 88.6 168.8 ... 130 \n",
"2 rwd front 94.5 171.2 ... 152 \n",
"3 fwd front 99.8 176.6 ... 109 \n",
"4 4wd front 99.4 176.6 ... 136 \n",
"\n",
" fuel_system bore stroke compression_ratio horsepower peak_rpm city_mpg \\\n",
"0 mpfi 3.47 2.68 9 111 5000 21 \n",
"1 mpfi 3.47 2.68 9 111 5000 21 \n",
"2 mpfi 2.68 3.47 9 154 5000 19 \n",
"3 mpfi 3.19 3.40 10 102 5500 24 \n",
"4 mpfi 3.19 3.40 8 115 5500 18 \n",
"\n",
" highway_mpg price \n",
"0 27 13495 \n",
"1 27 16500 \n",
"2 26 16500 \n",
"3 30 13950 \n",
"4 22 17450 \n",
"\n",
"[5 rows x 25 columns]"
],
"text/html": [
"\n",
"
\n",
"
\n",
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
symboling
\n",
"
make
\n",
"
fuel_type
\n",
"
aspiration
\n",
"
num_of_doors
\n",
"
body_style
\n",
"
drive_wheels
\n",
"
engine_location
\n",
"
wheel_base
\n",
"
length
\n",
"
...
\n",
"
engine_size
\n",
"
fuel_system
\n",
"
bore
\n",
"
stroke
\n",
"
compression_ratio
\n",
"
horsepower
\n",
"
peak_rpm
\n",
"
city_mpg
\n",
"
highway_mpg
\n",
"
price
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
3
\n",
"
alfa-romero
\n",
"
gas
\n",
"
std
\n",
"
2
\n",
"
convertible
\n",
"
rwd
\n",
"
front
\n",
"
88.6
\n",
"
168.8
\n",
"
...
\n",
"
130
\n",
"
mpfi
\n",
"
3.47
\n",
"
2.68
\n",
"
9
\n",
"
111
\n",
"
5000
\n",
"
21
\n",
"
27
\n",
"
13495
\n",
"
\n",
"
\n",
"
1
\n",
"
3
\n",
"
alfa-romero
\n",
"
gas
\n",
"
std
\n",
"
2
\n",
"
convertible
\n",
"
rwd
\n",
"
front
\n",
"
88.6
\n",
"
168.8
\n",
"
...
\n",
"
130
\n",
"
mpfi
\n",
"
3.47
\n",
"
2.68
\n",
"
9
\n",
"
111
\n",
"
5000
\n",
"
21
\n",
"
27
\n",
"
16500
\n",
"
\n",
"
\n",
"
2
\n",
"
1
\n",
"
alfa-romero
\n",
"
gas
\n",
"
std
\n",
"
2
\n",
"
hatchback
\n",
"
rwd
\n",
"
front
\n",
"
94.5
\n",
"
171.2
\n",
"
...
\n",
"
152
\n",
"
mpfi
\n",
"
2.68
\n",
"
3.47
\n",
"
9
\n",
"
154
\n",
"
5000
\n",
"
19
\n",
"
26
\n",
"
16500
\n",
"
\n",
"
\n",
"
3
\n",
"
2
\n",
"
audi
\n",
"
gas
\n",
"
std
\n",
"
4
\n",
"
sedan
\n",
"
fwd
\n",
"
front
\n",
"
99.8
\n",
"
176.6
\n",
"
...
\n",
"
109
\n",
"
mpfi
\n",
"
3.19
\n",
"
3.40
\n",
"
10
\n",
"
102
\n",
"
5500
\n",
"
24
\n",
"
30
\n",
"
13950
\n",
"
\n",
"
\n",
"
4
\n",
"
2
\n",
"
audi
\n",
"
gas
\n",
"
std
\n",
"
4
\n",
"
sedan
\n",
"
4wd
\n",
"
front
\n",
"
99.4
\n",
"
176.6
\n",
"
...
\n",
"
136
\n",
"
mpfi
\n",
"
3.19
\n",
"
3.40
\n",
"
8
\n",
"
115
\n",
"
5500
\n",
"
18
\n",
"
22
\n",
"
17450
\n",
"
\n",
" \n",
"
\n",
"
5 rows × 25 columns
\n",
"
\n",
" \n",
" \n",
" \n",
"\n",
" \n",
"
\n",
"
\n",
" "
]
},
"metadata": {},
"execution_count": 98
}
]
},
{
"cell_type": "markdown",
"source": [
"We've selected four features that cover a range of properties. Each of these features also has a high MI score with the target, `price`. We'll standardize the data since these features aren't naturally on the same scale."
],
"metadata": {
"id": "PFIztg2qV1oA"
}
},
{
"cell_type": "code",
"source": [
"features = [\"highway_mpg\", \"engine_size\", \"horsepower\", \"curb_weight\"]\n",
"\n",
"X = df.copy()\n",
"y = X.pop('price')\n",
"X = X.loc[:, features]\n",
"\n",
"# Standardize\n",
"X_scaled = (X - X.mean(axis=0)) / X.std(axis=0)"
],
"metadata": {
"id": "idZp66RfVwXt"
},
"execution_count": 99,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Now we can fit scikit-learn's `PCA` estimator and create the principal components. You can see here the first few rows of the transformed dataset."
],
"metadata": {
"id": "c3wTqeTBV4tg"
}
},
{
"cell_type": "code",
"source": [
"# Create principal components\n",
"pca = PCA()\n",
"X_pca = pca.fit_transform(X_scaled)\n",
"\n",
"# Convert to dataframe\n",
"component_names = [f\"PC{i+1}\" for i in range(X_pca.shape[1])]\n",
"X_pca = pd.DataFrame(X_pca, columns=component_names)\n",
"\n",
"X_pca.head()"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 206
},
"id": "6JTcIZ7DV3J-",
"outputId": "c1e85fe9-e901-4a0a-9fc4-7786feb38168"
},
"execution_count": 100,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" PC1 PC2 PC3 PC4\n",
"0 0.382486 -0.400222 0.124122 0.169539\n",
"1 0.382486 -0.400222 0.124122 0.169539\n",
"2 1.550890 -0.107175 0.598361 -0.256081\n",
"3 -0.408859 -0.425947 0.243335 0.013920\n",
"4 1.132749 -0.814565 -0.202885 0.224138"
],
"text/html": [
"\n",
"
\n",
"
\n",
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
PC1
\n",
"
PC2
\n",
"
PC3
\n",
"
PC4
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
0.382486
\n",
"
-0.400222
\n",
"
0.124122
\n",
"
0.169539
\n",
"
\n",
"
\n",
"
1
\n",
"
0.382486
\n",
"
-0.400222
\n",
"
0.124122
\n",
"
0.169539
\n",
"
\n",
"
\n",
"
2
\n",
"
1.550890
\n",
"
-0.107175
\n",
"
0.598361
\n",
"
-0.256081
\n",
"
\n",
"
\n",
"
3
\n",
"
-0.408859
\n",
"
-0.425947
\n",
"
0.243335
\n",
"
0.013920
\n",
"
\n",
"
\n",
"
4
\n",
"
1.132749
\n",
"
-0.814565
\n",
"
-0.202885
\n",
"
0.224138
\n",
"
\n",
" \n",
"
\n",
"
\n",
" \n",
" \n",
" \n",
"\n",
" \n",
"
\n",
"
\n",
" "
]
},
"metadata": {},
"execution_count": 100
}
]
},
{
"cell_type": "markdown",
"source": [
"After fitting, the `PCA` instance contains the loadings in its `components_` attribute. We'll wrap the loadings up in a dataframe."
],
"metadata": {
"id": "UEnzS398V-Cv"
}
},
{
"cell_type": "code",
"source": [
"loadings = pd.DataFrame(\n",
" pca.components_.T, # transpose the matrix of loadings\n",
" columns=component_names, # so the columns are the principal components\n",
" index=X.columns, # and the rows are the original features\n",
")\n",
"loadings"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 175
},
"id": "78T7AguDV7aO",
"outputId": "c2d11267-3440-4edb-9c55-1a8fa5837784"
},
"execution_count": 101,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" PC1 PC2 PC3 PC4\n",
"highway_mpg -0.492347 0.770892 0.070142 -0.397996\n",
"engine_size 0.503859 0.626709 0.019960 0.594107\n",
"horsepower 0.500448 0.013788 0.731093 -0.463534\n",
"curb_weight 0.503262 0.113008 -0.678369 -0.523232"
],
"text/html": [
"\n",
"
\n",
"
\n",
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
PC1
\n",
"
PC2
\n",
"
PC3
\n",
"
PC4
\n",
"
\n",
" \n",
" \n",
"
\n",
"
highway_mpg
\n",
"
-0.492347
\n",
"
0.770892
\n",
"
0.070142
\n",
"
-0.397996
\n",
"
\n",
"
\n",
"
engine_size
\n",
"
0.503859
\n",
"
0.626709
\n",
"
0.019960
\n",
"
0.594107
\n",
"
\n",
"
\n",
"
horsepower
\n",
"
0.500448
\n",
"
0.013788
\n",
"
0.731093
\n",
"
-0.463534
\n",
"
\n",
"
\n",
"
curb_weight
\n",
"
0.503262
\n",
"
0.113008
\n",
"
-0.678369
\n",
"
-0.523232
\n",
"
\n",
" \n",
"
\n",
"
\n",
" \n",
" \n",
" \n",
"\n",
" \n",
"
\n",
"
\n",
" "
]
},
"metadata": {},
"execution_count": 101
}
]
},
{
"cell_type": "markdown",
"source": [
"Recall that the signs and magnitudes of a component's loadings tell us what kind of variation it's captured. The first component (`PC1`) shows a contrast between large, powerful vehicles with poor gas milage, and smaller, more economical vehicles with good gas milage. We might call this the \"Luxury/Economy\" axis. The next figure shows that our four chosen features mostly vary along the Luxury/Economy axis."
],
"metadata": {
"id": "oAx9iIwKWKrv"
}
},
{
"cell_type": "code",
"source": [
"# Look at explained variance\n",
"plot_variance(pca)"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 456
},
"id": "rEHb4338WIUA",
"outputId": "d8d16d16-af1c-45a9-f70d-09b382af4ff5"
},
"execution_count": 102,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"array([,\n",
" ],\n",
" dtype=object)"
]
},
"metadata": {},
"execution_count": 102
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"
"
],
"image/png": "\n"
},
"metadata": {
"needs_background": "light"
}
}
]
},
{
"cell_type": "markdown",
"source": [
"Let's also look at the MI scores of the components. Not surprisingly, `PC1` is highly informative, though the remaining components, despite their small variance, still have a significant relationship with `price`. Examining those components could be worthwhile to find relationships not captured by the main Luxury/Economy axis."
],
"metadata": {
"id": "tEdlKVpvWRo2"
}
},
{
"cell_type": "code",
"source": [
"mi_scores = make_mi_scores(X_pca, y, discrete_features=False)\n",
"mi_scores"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "L_EhGRNuWPfg",
"outputId": "ba784c35-4b6b-42d9-e4a9-7d3a0b052ed8"
},
"execution_count": 103,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"PC1 1.013190\n",
"PC2 0.379271\n",
"PC3 0.306780\n",
"PC4 0.204163\n",
"Name: MI Scores, dtype: float64"
]
},
"metadata": {},
"execution_count": 103
}
]
},
{
"cell_type": "markdown",
"source": [
"The third component shows a contrast between `horsepower` and `curb_weight` -- sports cars vs. wagons, it seems."
],
"metadata": {
"id": "XX88JTAsWY9f"
}
},
{
"cell_type": "code",
"source": [
"# Show dataframe sorted by PC3\n",
"idx = X_pca[\"PC3\"].sort_values(ascending=False).index\n",
"cols = [\"make\", \"body_style\", \"horsepower\", \"curb_weight\"]\n",
"df.loc[idx, cols]"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 424
},
"id": "NhtyKfvHWUNY",
"outputId": "6ddde196-0e17-46cd-962c-e0fb0db78116"
},
"execution_count": 104,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" make body_style horsepower curb_weight\n",
"118 porsche hardtop 207 2756\n",
"117 porsche hardtop 207 2756\n",
"119 porsche convertible 207 2800\n",
"45 jaguar sedan 262 3950\n",
"96 nissan hatchback 200 3139\n",
".. ... ... ... ...\n",
"59 mercedes-benz wagon 123 3750\n",
"61 mercedes-benz sedan 123 3770\n",
"101 peugot wagon 95 3430\n",
"105 peugot wagon 95 3485\n",
"143 toyota wagon 62 3110\n",
"\n",
"[193 rows x 4 columns]"
],
"text/html": [
"\n",
"
"
],
"image/png": "\n"
},
"metadata": {
"needs_background": "light"
}
}
]
},
{
"cell_type": "markdown",
"source": [
"#### UMAP"
],
"metadata": {
"id": "gJIG1-zcacTO"
}
},
{
"cell_type": "markdown",
"source": [
"UMAP is useful for generating visualisations, but if you want to make use of UMAP more generally for machine learning tasks it is important to be be able to train a model and then later pass new data to the model and have it transform that data into the learned space. For example if we use UMAP to learn a latent space and then train a classifier on data transformed into the latent space then the classifier is only useful for prediction if we can transform data for which we want a prediction into the latent space the classifier uses. "
],
"metadata": {
"id": "RFco1ZdqawkV"
}
},
{
"cell_type": "code",
"source": [
"X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target, stratify=digits.target, random_state=42)"
],
"metadata": {
"id": "Xd89JSOraYKk"
},
"execution_count": 117,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Now to get a benchmark idea of what we are looking at let’s train a couple of different classifiers and then see how well they score on the test set. For this example let’s try a support vector classifier and a KNN classifier."
],
"metadata": {
"id": "4MoEzfIIbGh_"
}
},
{
"cell_type": "code",
"source": [
"svc = SVC(gamma='auto').fit(X_train, y_train)\n",
"knn = KNeighborsClassifier().fit(X_train, y_train)\n",
"svc.score(X_test, y_test), knn.score(X_test, y_test)"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "Iv6WyDIRa__H",
"outputId": "817ceab4-11cd-44b4-de28-91f2a6d38c57"
},
"execution_count": 121,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"(0.62, 0.9844444444444445)"
]
},
"metadata": {},
"execution_count": 121
}
]
},
{
"cell_type": "markdown",
"source": [
"The goal now is to make use of UMAP as a preprocessing step that one could potentially fit into a pipeline. "
],
"metadata": {
"id": "Gx02oiB3b8nV"
}
},
{
"cell_type": "code",
"source": [
"trans = umap.UMAP(n_neighbors=5, random_state=42).fit(X_train)"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "SU4gnbqrbM9h",
"outputId": "47e1956e-22ad-42cc-8906-69725e5f773f"
},
"execution_count": 122,
"outputs": [
{
"output_type": "stream",
"name": "stderr",
"text": [
"/usr/local/lib/python3.7/dist-packages/numba/np/ufunc/parallel.py:363: NumbaWarning: The TBB threading layer requires TBB version 2019.5 or later i.e., TBB_INTERFACE_VERSION >= 11005. Found TBB_INTERFACE_VERSION = 9107. The TBB threading layer is disabled.\n",
" warnings.warn(problem)\n"
]
}
]
},
{
"cell_type": "code",
"source": [
"plt.figure(figsize=(10, 8))\n",
"plt.scatter(trans.embedding_[:, 0], trans.embedding_[:, 1], c=y_train, cmap='Spectral', s=5)\n",
"plt.gca().set_aspect('equal', 'datalim')\n",
"plt.colorbar(boundaries=np.arange(11)-0.5).set_ticks(np.arange(10))\n",
"plt.title('Umap of the Digits dataset', fontsize=24)"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 525
},
"id": "vKQdE0_LcFMG",
"outputId": "6e5d02a2-03c0-47e0-9291-829be88b7baf"
},
"execution_count": 125,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"Text(0.5, 1.0, 'Umap of the Digits dataset')"
]
},
"metadata": {},
"execution_count": 125
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"
"
],
"image/png": "\n"
},
"metadata": {
"needs_background": "light"
}
}
]
},
{
"cell_type": "markdown",
"source": [
"This looks very promising! Most of the classes got very cleanly separated, and that gives us some hope that it could help with classifier performance. We can now train some new models (again an SVC and a KNN classifier) on the embedded training data. This looks exactly as before but now we pass it the embedded data. "
],
"metadata": {
"id": "NVGIGUqCcY-r"
}
},
{
"cell_type": "code",
"source": [
"svc = SVC(gamma='auto').fit(trans.embedding_, y_train)\n",
"knn = KNeighborsClassifier().fit(trans.embedding_, y_train)"
],
"metadata": {
"id": "vaiBl0iycI3W"
},
"execution_count": 131,
"outputs": []
},
{
"cell_type": "code",
"source": [
"test_embedding = trans.transform(X_test)"
],
"metadata": {
"id": "g1_0phAhchBB"
},
"execution_count": 128,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"The next important question is what the transform did to our test data. In principle we have a new two dimensional representation of the test-set, and ideally this should be based on the existing embedding of the training set"
],
"metadata": {
"id": "yqqhDql-c0M9"
}
},
{
"cell_type": "code",
"source": [
"plt.figure(figsize=(10, 8))\n",
"plt.scatter(test_embedding[:, 0], test_embedding[:, 1], c=y_test, cmap='Spectral', s=5)\n",
"plt.gca().set_aspect('equal', 'datalim')\n",
"plt.colorbar(boundaries=np.arange(11)-0.5).set_ticks(np.arange(10))\n",
"plt.title('Umap of the Digits dataset', fontsize=24)"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 525
},
"id": "gblKRyguczF7",
"outputId": "b8edc9f9-21aa-4d60-b5c5-9260820bf94e"
},
"execution_count": 130,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"Text(0.5, 1.0, 'Umap of the Digits dataset')"
]
},
"metadata": {},
"execution_count": 130
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"
"
],
"image/png": "\n"
},
"metadata": {
"needs_background": "light"
}
}
]
},
{
"cell_type": "markdown",
"source": [
"The results look like what we should expect; the test data has been embedded into two dimensions in exactly the locations we should expect (by class) given the embedding of the training data visualised above. This means we can now try out models that were trained on the embedded training data by handing them the newly transformed test set."
],
"metadata": {
"id": "LI9PGw5CdBFH"
}
},
{
"cell_type": "code",
"source": [
"svc.score(trans.transform(X_test), y_test), knn.score(trans.transform(X_test), y_test)"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "__808eXkc38M",
"outputId": "960e5116-027b-4929-de7b-a88e6ef8337c"
},
"execution_count": 132,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"(0.9822222222222222, 0.9822222222222222)"
]
},
"metadata": {},
"execution_count": 132
}
]
},
{
"cell_type": "markdown",
"source": [
"The results are pretty good. While the accuracy of the KNN classifier did not improve there was not a lot of scope for improvement given the data. On the other hand the SVC has improved to have equal accuracy to the KNN classifier!\n",
"\n",
"For more interesting datasets the larger dimensional embedding might have been a significant gain – it is certainly worth exploring as one of the parameters in a grid search across a pipeline that includes UMAP.\n",
"\n"
],
"metadata": {
"id": "1GN4oP95dSGe"
}
},
{
"cell_type": "markdown",
"source": [
"## Clustering"
],
"metadata": {
"id": "KqWbd1rSdync"
}
},
{
"cell_type": "markdown",
"source": [
"When used for feature engineering, we could attempt to discover groups of customers representing a market segment, for instance, or geographic areas that share similar weather patterns. Adding a feature of cluster labels can help machine learning models untangle complicated relationships of space or proximity."
],
"metadata": {
"id": "Gwnv4I6BesNV"
}
},
{
"cell_type": "markdown",
"source": [
"### Cluster Labels as a feature"
],
"metadata": {
"id": "WErqxr32gLk1"
}
},
{
"cell_type": "markdown",
"source": [
"Applied to a single real-valued feature, clustering acts like a traditional \"binning\" or \"discretization\" transform. On multiple features, it's like \"multi-dimensional binning\" (sometimes called vector quantization)."
],
"metadata": {
"id": "neHcqLiFgQfw"
}
},
{
"cell_type": "markdown",
"source": [
"It's important to remember that this Cluster feature is categorical. Here, it's shown with a label encoding (that is, as a sequence of integers) as a typical clustering algorithm would produce; depending on your model, a one-hot encoding may be more appropriate.\n",
"\n",
"The motivating idea **for adding cluster labels is that the clusters will break up complicated relationships across features into simpler chunks**. Our model can then just learn the simpler chunks one-by-one instead having to learn the complicated whole all at once. It's a \"divide and conquer\" strategy."
],
"metadata": {
"id": "igCdboqEgZGm"
}
},
{
"cell_type": "markdown",
"source": [
"As spatial features, [*California Housing*](https://www.kaggle.com/camnugent/california-housing-prices)'s `'Latitude'` and `'Longitude'` make natural candidates for k-means clustering. In this example we'll cluster these with `'MedInc'` (median income) to create economic segments in different regions of California."
],
"metadata": {
"id": "rxryA26wfL5-"
}
},
{
"cell_type": "markdown",
"source": [
"Since k-means clustering is sensitive to scale, it can be a good idea rescale or normalize data with extreme values. Our features are already roughly on the same scale, so we'll leave them as-is."
],
"metadata": {
"id": "4WNdYtx3fHYl"
}
},
{
"cell_type": "code",
"source": [
"df = fetch_california_housing(as_frame=True)['frame']"
],
"metadata": {
"id": "mfv9WpC_dGs9"
},
"execution_count": 136,
"outputs": []
},
{
"cell_type": "code",
"source": [
"X = df.loc[:, [\"MedInc\", \"Latitude\", \"Longitude\"]]\n",
"X.head()"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 206
},
"id": "n-kpl0jAfTF7",
"outputId": "665d6557-c7d7-452f-956a-d5a9210dc570"
},
"execution_count": 138,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" MedInc Latitude Longitude\n",
"0 8.3252 37.88 -122.23\n",
"1 8.3014 37.86 -122.22\n",
"2 7.2574 37.85 -122.24\n",
"3 5.6431 37.85 -122.25\n",
"4 3.8462 37.85 -122.25"
],
"text/html": [
"\n",
"